1.
overview and scope of application
applicable objects: block storage, object storage and file service nodes deployed in the singapore region (such as aws ap-southeast-1, alibaba cloud singapore, etc.).
goal: ensure availability, predictable capacity, operationalization and automation of alarms. this article uses prometheus/grafana/alertmanager as an example monitoring stack, and includes actual expansion and temporary processing steps.
2.
monitoring item collection and deployment steps (instance level)
steps: 1) install node_exporter on each storage server: sudo apt update && sudo apt install -y prometheus-node-exporter.
2) configure prometheus scrape: add - job_name: 'nodes' static_configs: - targets: ['ip:9100'] to prometheus.yml and restart prometheus. sudo systemctl restart prometheus.
3) collection items: disk usage (/, /data), inode usage, disk latency (iostat or node_exporter disk_latency), network bandwidth, cpu, memory, disk queue length, number of file handles.
3.
object storage and gateway monitoring
steps: 1) for s3-compatible storage, turn on the access log on the storage side, push it to a dedicated bucket and parse it with fluentd/fluent bit and report it to prometheus or send it directly to elasticsearch.
2) key indicators: put/get 4xx/5xx rate, 95/99p response delay, sharding/replication delay, object number growth rate, life cycle hot/cold times.
4.
alarm rules and threshold recommendations (example)
example prometheus rules: 1) disk_usage_percent > 80 for 5m → warning; >90 for 2m → critical.
2) inode_usage > 90% for 5m. 3) disk_io_avg_latency_ms > 50ms for 5m. 4) s3_5xx_rate > 0.5% for 10m.
rule writing reference: alert: diskalmostfull expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) * 100 < 20
5.
alarm routing and receiver configuration
steps: 1) configure routes in alertmanager: route to slack/email/pagerduty/sms by severity, team, and service classification.
2) configure templates and suppression rules (snooze): short-term i/o peaks can be suppressed for 15 minutes.
3) test process: use amtool or curl to trigger a simulated alarm and confirm receipt and carbon copy.
6.
alarm handling (runbook) and quick handling commands
general process: receive an alarm → log in to the affected host → check top/df -h/iostat/vmstat → determine whether it is a sudden increase or a long-term increase.
quickly free up space: 1) clean /var/log: sudo journalctl --vacuum-time=3d; 2) clean temporary directories: sudo rm -rf /tmp/*; 3) delete old backups or migrate to cold storage (example: aws s3 mv /backup s3://cold-bucket --storage-class glacier).
temporary solution for capacity expansion: mount a new disk, rsync the data to the new disk, and update fstab.
7.
capacity planning steps (detailed how-to guide)
1) data collection: export daily used_bytes, object_count, daily_ingest_bytes for the past 90-180 days; you can use prometheus or cloud monitoring api (aws cloudwatch) to export csv.
2) calculate the daily growth rate: use linear regression or find the average daily increment of the last 30 days = (last - first)/days.
3) forecast and safety factor: take 95% of the forecast based on business peaks, and add strategic redundancy of 20%-30% (up to 50% for key businesses).
4) develop a retention and tiering policy: hot storage for 30 days, cold storage for 90-365 days and enable automatic transfer of life cycle rules. documented and registered in cmdb.
8.
capacity expansion operation (block storage/cloud disk and file system)
cloud disk expansion (taking aws as an example): 1) aws ec2 modify-volume --volume-id vol-xxx --size 200 --region ap-southeast-1.
2) check on the instance: sudo lsblk, if you need to expand the partition: sudo growpart /dev/xvdf 1; then expand the file system: for xfs sudo xfs_growfs /mountpoint; for ext4 sudo resize2fs /dev/xvdf1.
add a new disk and migrate: mount the new disk → rsync -av /data/ /mnt/newdata/ → modify fstab → restart the service and gradually switch.
9.
q&a 1
question: how to prevent abnormal 5xx alarms of object storage from being falsely reported in the singapore region?
answer: the key is to set short-term suppression and percentage thresholds: use the 5xx request rate (5xx_count / total_requests) as an indicator, and configure a threshold such as >0.5% for 10 minutes as an alarm. at the same time, false alarms caused by short-term deployment are suppressed (silent when deploy_tag=true), and the request delay and back-end error rate are combined to determine whether it is a real fault.
10.
q&a 2
question: what historical window is more accurate for capacity forecasting?
answer: a window of 90 to 180 days is usually used to take into account seasonality and recent trends. for rapidly growing businesses, the 30-day growth rate and the 90-day growth rate can be calculated in parallel, taking conservative values and retaining 20%-30% redundancy. temporary adjustments are required when there are promotions or migration windows.
11.
question 3
question: what should be the first step when the disk suddenly receives a high io alarm?
answer: the first step is to check the traffic and process: log in to the host and execute iostat -x 1 5, iotop, ps aux --sort=-%cpu to determine whether it is caused by backup/scan/batch processing; if it is an expected task, prioritize speed limiting or migration tasks; if it is an abnormal write, find the large file generator and temporarily stop the service. if necessary, remove the hot data to the cold disk.

- Latest articles
- Can I Open A Roaming Server In Malaysia? Technical Implementation Path And Network Configuration Suggestions
- Network Design And Fault Recovery Strategy Using Malaysian Cn2 To Build A High-availability Architecture
- How Can Newbies Complete Taiwan Vps Server Rental And Resource Planning Within A Budget?
- How Overseas Users Use Japanese Native Ip L2tp To Access Local Services And Optimization Suggestions
- Stability Analysis Of Singtel's Computer Room Cn2 In Voip And Live Video Scenarios
- Best Practices For Using American Computer Room Servers In Enterprise-level Application Scenarios
- From The Perspective Of Security Operation And Maintenance, The Emergency Response And Recovery Process Of Japanese Server Cracking Software
- Technical Capabilities And Deployment Efficiency Analysis Of Common Technical Advantages Of High-quality Vietnamese Server Shops
- How To Judge Whether The Japanese Cn2 Gia Line Is Suitable For Your Website Access Needs
- Alibaba Cloud Malaysia Lightweight Server Entry-level Deployment And Performance Optimization One-step Tutorial
- Popular tags
-
Singapore Vps Recommends How To Choose The Most Suitable Service Provider
this article will introduce in detail how to choose a singapore vps service provider that is suitable for you, including evaluation of cost performance, performance and service quality. -
Principles For Adapting Singapore Cloud Server Selection Rules To Different Loads And Business Scales
from small sites to cross-regional e-commerce, this article systematically explains how to choose the right cloud server for different loads and business scales in the singapore environment: practical principles and decision-making processes for performance, network, storage, cost, compliance and elastic expansion. -
Reasons And Usage Experience For Choosing Singapore Cn2 Cloud Server
discuss why you choose singapore cn2 cloud server and its usage experience, including analysis of performance, stability, support, etc.